Background and Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank

Data Dictionary:

Import packages

Statistical Overview

Data Pre-Processing

EDA Analysis

Treating outliers

Summary variables

Feature Engineering

Data Preparation for Modeling

Checking inverse mapped values/categories.

Creating Dummy Variables

Building the Model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will attrite the credit card but it doesn't - Loss of resources
  2. Predicting a customer will not attrite the credit card but it does - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Oversampling data using SMOTE

Buidling the SMOTE Model

Down Sampling the larger class

Hyperparameter Tuning

Since

had in downsample and upsample a good performance, the tunning will be done on the first two models and consider undersampled data. I will not use XGBoost due to the time consuming process and the model will also be analyzed later for the total data, which will provide enough input.

Decision Tree tunning with undersampling

Tuning Random Forest

Adaboost GridSearchCV

RandomizedSearchCV

XGBoost Gridsearch CV

Comparing all models

Performance on test set

Pipelines for productionizing the model

Business Insights and Recommendations